cultural alignment
CulturePark: Boosting Cross-cultural Understanding in Large Language Models
Cultural bias is pervasive in many large language models (LLMs), largely due to the deficiency of data representative of different cultures.Typically, cultural datasets and benchmarks are constructed either by extracting subsets of existing datasets or by aggregating from platforms such as Wikipedia and social media.However, these approaches are highly dependent on real-world data and human annotations, making them costly and difficult to scale.Inspired by cognitive theories on social communication, this paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection.CulturePark simulates cross-cultural human communication with LLM-based agents playing roles in different cultures.It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs.Using CulturePark, we generated 41,000 cultural samples to fine-tune eight culture-specific LLMs.We evaluated these models across three downstream tasks: content moderation, cultural alignment, and cultural education.Results show that for content moderation, our GPT-3.5-based
Towards A Cultural Intelligence and Values Inferences Quality Benchmark for Community Values and Common Knowledge
Johnson, Brittany, Reddick, Erin, Smith, Angela D. R.
Large language models (LLMs) have emerged as a powerful technology, and thus, we have seen widespread adoption and use on software engineering teams. Most often, LLMs are designed as "general purpose" technologies meant to represent the general population. Unfortunately, this often means alignment with predominantly Western Caucasian narratives and misalignment with other cultures and populations that engage in collaborative innovation. In response to this misalignment, there have been recent efforts centered on the development of "culturally-informed" LLMs, such as ChatBlackGPT, that are capable of better aligning with historically marginalized experiences and perspectives. Despite this progress, there has been little effort aimed at supporting our ability to develop and evaluate culturally-informed LLMs. A recent effort proposed an approach for developing a national alignment benchmark that emphasizes alignment with national social values and common knowledge. However, given the range of cultural identities present in the United States (U.S.), a national alignment benchmark is an ineffective goal for broader representation. To help fill this gap in this US context, we propose a replication study that translates the process used to develop KorNAT, a Korean National LLM alignment benchmark, to develop CIVIQ, a Cultural Intelligence and Values Inference Quality benchmark centered on alignment with community social values and common knowledge. Our work provides a critical foundation for research and development aimed at cultural alignment of AI technologies in practice.
- North America > United States > District of Columbia > Washington (0.15)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
- Questionnaire & Opinion Survey (1.00)
- Research Report (0.82)
- Law (0.68)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.68)
- Health & Medicine (0.47)
- Education (0.47)
Cultural Prompting Improves the Empathy and Cultural Responsiveness of GPT-Generated Therapy Responses
Xie, Serena Jinchen, Zhai, Shumenghui, Liang, Yanjing, Li, Jingyi, Fan, Xuehong, Cohen, Trevor, Yuwen, Weichao
Large Language Model (LLM)-based conversational agents offer promising solutions for mental health support, but lack cultural responsiveness for diverse populations. This study evaluated the effectiveness of cultural prompting in improving cultural responsiveness and perceived empathy of LLM-generated therapeutic responses for Chinese American family caregivers. Using a randomized controlled experiment, we compared GPT-4o and Deepseek-V3 responses with and without cultural prompting. Thirty-six participants evaluated input-response pairs on cultural responsiveness (competence and relevance) and perceived empathy. Results showed that cultural prompting significantly enhanced GPT-4o's performance across all dimensions, with GPT-4o with cultural prompting being the most preferred, while improvements in DeepSeek-V3 responses were not significant. Mediation analysis revealed that cultural prompting improved empathy through improving cultural responsiveness. This study demonstrated that prompt-based techniques can effectively enhance the cultural responsiveness of LLM-generated therapeutic responses, highlighting the importance of cultural responsiveness in delivering empathetic AI-based therapeutic interventions to culturally and linguistically diverse populations.
- North America > United States > Washington > Pierce County > Tacoma (0.14)
- North America > United States > Washington > King County > Seattle (0.14)
- Asia > Southeast Asia (0.04)
- (2 more...)
- Research Report > Strength High (1.00)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models
Fatehkia, Masoomali, Altinisik, Enes, Sencar, Husrev Taha
Content moderation filters are a critical safeguard against alignment failures in language models. Yet most existing filters focus narrowly on general safety and overlook cultural context. In this work, we introduce FanarGuard, a bilingual moderation filter that evaluates both safety and cultural alignment in Arabic and English. We construct a dataset of over 468K prompt and response pairs, drawn from synthetic and public datasets, scored by a panel of LLM judges on harmlessness and cultural awareness, and use it to train two filter variants. To rigorously evaluate cultural alignment, we further develop the first benchmark targeting Arabic cultural contexts, comprising over 1k norm-sensitive prompts with LLM-generated responses annotated by human raters. Results show that FanarGuard achieves stronger agreement with human annotations than inter-annotator reliability, while matching the performance of state-of-the-art filters on safety benchmarks. These findings highlight the importance of integrating cultural awareness into moderation and establish FanarGuard as a practical step toward more context-sensitive safeguards.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > Middle East (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
- (2 more...)
- Law (1.00)
- Health & Medicine > Therapeutic Area (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
Reasoning Shapes Alignment: Investigating Cultural Alignment in Large Reasoning Models with Cultural Norms
Wang, Yuhang, Zhu, Yanxu, Sang, Jitao
The advanced reasoning capabilities of Large Reasoning Models enable them to thoroughly understand and apply safety policies through deliberate thought processes, thereby improving the models' safety. Beyond safety, these models must also be able to reflect the diverse range of human values across various cultures. This paper presents the Cultural Norm-based Cultural Alignment (CNCA) framework, which enables models to leverage their powerful reasoning ability to align with cultural norms. Specifically, we propose three methods to automatically mine cultural norms from limited survey data and explore ways to effectively utilize these norms for improving cultural alignment. Two alignment paradigms are examined: an in-context alignment method, where cultural norms are explicitly integrated into the user context, and a fine-tuning-based method, which internalizes norms through enhanced Chain-of-Thought training data. Comprehensive experiments demonstrate the effectiveness of these methods, highlighting that models with stronger reasoning capabilities benefit more from cultural norm mining and utilization. Our findings emphasize the potential for reasoning models to better reflect diverse human values through culturally informed alignment strategies.
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Questionnaire & Opinion Survey (1.00)
- Health & Medicine > Therapeutic Area (0.47)
- Health & Medicine > Consumer Health (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning
Abdalla, M. H. I., Wang, Zhipin, Frey, Christian, Eger, Steffen, Grabocka, Josif
Large Language Model (LLM) conditioning refers to instructing an LLM to generate content in accordance with the norms and values of a specific culture, beliefs of a particular political orientation, or any desired text-specified semantic conditioning. Unfortunately, prompt engineering does not ensure that LLMs behave in accordance with a desired conditioning due to the inductive bias of the pre-training and alignment datasets. Prior works have focused on fine-tuning LLMs by directly conditioning the LoRA weights; however, such methods introduce a large number of parameters. As a remedy, we propose Zhyper, a parameter-efficient factorized hypernetwork framework that generates context-aware LoRA adapters from textual descriptions. Experiments on multiple benchmarks show that Zhyper achieves competitive performance with up to 26x fewer parameters than the state-of-the-art baselines. Furthermore, we extend Zhyper to cultural alignment, demonstrating improved generalization to out-of-domain settings and a better capturing of fine-grained contextual values. Large Language Models (LLMs) have transformed Natural Language Processing (NLP), Computer Vision (CV), and machine learning (ML) more broadly. They achieve state-of-the-art performance in text generation and comprehension across diverse domains, including code synthesis (Rozi ` ere et al., 2023), mathematical reasoning (Ahn et al., 2024), scientific writing (Geng et al., 2025; Eger et al., 2025), multimodal tasks such as text-image understanding and generation (Alayrac et al., 2022), and evaluation of machine translation and related tasks (Gu et al., 2025). This success stems from scaling to millions and billions of parameters. However, this scaling requires large computational resources, motivating the search for parameter-efficient fine-tuning (PEFT) techniques. Recent advances have made it possible to adapt LLMs to task-specific criteria, which is crucial for a broader applicability and acceptance of NLP systems. A recent stream of research leverages PEFT techniques (Ding et al., 2023; Weyssow et al., 2023; Prottasha et al., 2024), e.g., Low-Rank Adaptions (LoRA) (Hu et al., 2021) to adapt for desired task-specific values in an LLM. LoRA achieves this by freezing most of the pre-trained model's parameters and introducing trainable low-rank matrices, yielding weight correction terms. However, stand-alone LoRA approaches are primarily tailored for a single-task adaptation and may lose their effectiveness in a setting where an LLM needs to be adapted to various downstream settings.
- Europe > Austria > Vienna (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Brazil (0.05)
- (49 more...)
I Am Aligned, But With Whom? MENA Values Benchmark for Evaluating Cultural Alignment and Multilingual Bias in LLMs
Zahraei, Pardis Sadat, Asgari, Ehsaneddin
We introduce MENAValues, a novel benchmark designed to evaluate the cultural alignment and multilingual biases of large language models (LLMs) with respect to the beliefs and values of the Middle East and North Africa (MENA) region, an underrepresented area in current AI evaluation efforts. Drawing from large-scale, authoritative human surveys, we curate a structured dataset that captures the sociocultural landscape of MENA with population-level response distributions from 16 countries. To probe LLM behavior, we evaluate diverse models across multiple conditions formed by crossing three perspective framings (neutral, personalized, and third-person/cultural observer) with two language modes (English and localized native languages: Arabic, Persian, Turkish). Our analysis reveals three critical phenomena: "Cross-Lingual Value Shifts" where identical questions yield drastically different responses based on language, "Reasoning-Induced Degradation" where prompting models to explain their reasoning worsens cultural alignment, and "Logit Leakage" where models refuse sensitive questions while internal probabilities reveal strong hidden preferences. We further demonstrate that models collapse into simplistic linguistic categories when operating in native languages, treating diverse nations as monolithic entities. MENAValues offers a scalable framework for diagnosing cultural misalignment, providing both empirical insights and methodological tools for developing more culturally inclusive AI.
- Africa > North Africa (0.24)
- Africa > Sudan (0.14)
- North America > United States > Washington > King County > Seattle (0.14)
- (29 more...)
- Research Report > New Finding (1.00)
- Questionnaire & Opinion Survey (1.00)
- Education (0.68)
- Government (0.46)
- Banking & Finance > Economy (0.45)
'Too much alignment; not enough culture': Re-balancing cultural alignment practices in LLMs
Orlowski, Eric J. W., Norhashim, Hakim, Wey, Tristan Koh Ly
While cultural alignment has increasingly become a focal point within AI research, current approaches relying predominantly on quantitative benchmarks and simplistic proxies fail to capture the deeply nuanced and context-dependent nature of human cultures. Existing alignment practices typically reduce culture to static demographic categories or superficial cultural facts, thereby sidestepping critical questions about what it truly means to be culturally aligned. This paper argues for a fundamental shift towards integrating interpretive qualitative approaches drawn from social sciences into AI alignment practices, specifically in the context of Large Language Models (LLMs). Drawing inspiration from Clifford Geertz's concept of "thick description," we propose that AI systems must produce outputs that reflect deeper cultural meanings--what we term "thick outputs"-grounded firmly in user-provided context and intent. We outline three necessary conditions for successful cultural alignment: sufficiently scoped cultural representations, the capacity for nuanced outputs, and the anchoring of outputs in the cultural contexts implied within prompts. Finally, we call for cross-disciplinary collaboration and the adoption of qualitative, ethnographic evaluation methods as vital steps toward developing AI systems that are genuinely culturally sensitive, ethically responsible, and reflective of human complexity.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > Singapore (0.04)
Culture is Everywhere: A Call for Intentionally Cultural Evaluation
Oh, Juhyun, Cha, Inha, Saxon, Michael, Lim, Hyunseung, Bhatt, Shaily, Oh, Alice
The prevailing ``trivia-centered paradigm'' for evaluating the cultural alignment of large language models (LLMs) is increasingly inadequate as these models become more advanced and widely deployed. Existing approaches typically reduce culture to static facts or values, testing models via multiple-choice or short-answer questions that treat culture as isolated trivia. Such methods neglect the pluralistic and interactive realities of culture, and overlook how cultural assumptions permeate even ostensibly ``neutral'' evaluation settings. In this position paper, we argue for \textbf{intentionally cultural evaluation}: an approach that systematically examines the cultural assumptions embedded in all aspects of evaluation, not just in explicitly cultural tasks. We systematically characterize the what, how, and circumstances by which culturally contingent considerations arise in evaluation, and emphasize the importance of researcher positionality for fostering inclusive, culturally aligned NLP research. Finally, we discuss implications and future directions for moving beyond current benchmarking practices, discovering important applications that we don't know exist, and involving communities in evaluation design through HCI-inspired participatory methodologies.
- Europe > Austria > Vienna (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (9 more...)
- Overview (0.68)
- Research Report (0.64)
- Questionnaire & Opinion Survey (0.46)
Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation
Mukherjee, Arka, Ghosh, Shreya
As Vision-Language Models (VLMs) achieve widespread deployment across diverse cultural contexts, ensuring their cultural competence becomes critical for responsible AI systems. While prior work has evaluated cultural awareness in text-only models and VLM object recognition tasks, no research has systematically assessed how VLMs adapt outputs when cultural identity cues are embedded in both textual prompts and visual inputs during generative tasks. W e present the first comprehensive evaluation of VLM cultural competence through multimodal story generation, developing a novel multimodal framework that perturbs cultural identity and evaluates 5 contemporary VLMs on a downstream task: story generation. Our analysis reveals significant cultural adaptation capabilities, with rich culturally-specific vocabulary spanning names, familial terms, and geographic markers. However, we uncover concerning limitations: cultural competence varies dramatically across architectures, some models exhibit inverse cultural alignment, and automated metrics show architectural bias contradicting human assessments. Cross-modal evaluation shows that culturally distinct outputs are indeed detectable through visual-semantic similarity (28.7% within-nationality vs. 0.2% cross-nationality recall), yet visual-cultural understanding remains limited. In essence, we establish the promise and challenges of cultural competence in multimodal AI.